Optimizing strategy for the discovery of compositionally-biased or low-complexity regions in proteins

Proteins can contain tracts dominated by a subset of amino acids and that have a functional significance. These are often termed ‘low-complexity regions’ (LCRs) or ‘compositionally-biased regions’ (CBRs). However, a wide spectrum of compositional bias is possible, and program parameters used to annotate these regions are often arbitrarily chosen. Also, investigators are sometimes interested in longer regions, or sometimes very short ones. Here, two programs for annotating LCRs/CBRs, namely SEG and fLPS, are investigated in detail across the whole expanse of their parameter spaces. In doing so, boundary behaviours are resolved that are used to derive an optimized systematic strategy for annotating LCRs/CBRs. Sets of parameters that progressively annotate or ‘cover’ more of protein sequence space and are optimized for a given target length have been derived. This progressive annotation can be applied to discern the biological relevance of CBRs, e.g., in parsing domains for experimental constructs and in generating hypotheses. It is also useful for picking out candidate regions of interest of a given target length and bias signature, and for assessing the parameter dependence of annotations. This latter application is demonstrated for a set of human intrinsically-disordered proteins associated with cancer.


Running the SEG and fLPS programs
The two programs SEG and fLPS were investigated across a thorough sample of their parameter spaces.Both programs use three main parameters for CBR discovery, but extract fundamentally different information from the sequences.Figure 1 details both algorithms.The fLPS program (Fig. 1A) works through a process of binomial probability minimization 14 .A binomial P-value can be calculated for the amino-acid biases of any sequence tract.SEG uses sequence entropy to search for CBRs/LCRs (Fig. 1B).

Figure 1.
Schematics of the algorithms.(A) fLPS algorithm.The three main parameters are maximum and minimum window lengths M and m, and an output P-value threshold t.Window lengths down from M to m are searched for single-residue biases with P-values less than a fixed high threshold (= 0.001 here).These are then used to form contigs out of which lowest-probability subsequences (LPSs) are calculated (multiple LPSs from the same contig are possible).Then, multiple-residue biased regions are tested for, which includes trimming or extending to obtain the multiple-residue LPS.Finally, the output is filtered with t, the threshold P-value.(B) SEG algorithm.Regions with low sequence entropy are more 'ordered' since they are dominated by a few of the possible amino-acid residue types.SEG works by scanning along sequences for windows of length L that have sequence entropy ≤ K1, a trigger threshold.Then, these 'trigger windows' are extended with further windows whose sequence entropy is ≤ K2, the extension threshold, to form a contig.SEG LPSs are then calculated from these contigs using recursion.
The parameter spaces searched are tabulated (Table 1).For fLPS, default background 'domains' amino-acid frequencies were used.These were updated using the downloaded ASTRAL sequences (however, the frequencies have changed minimally indicating likely convergence).For both programs, there are recommended parameter sets to discover shorter low-complexity regions (Table 1).Also, there are SEG parameter sets to label longer biased regions, such as those made of longer tandem repeats, whose repetitiveness is only captured by longer L windows.The default fLPS settings are intended as a 'catch-all' set of very loose parameters that informs users of all CBRs in their sequences, even very mild short tracts.After a default run, the user is expected to home in on biases of interest with smaller t P-value thresholds.For example, t = 1e−10 was used to study prion-like proteins, since 1e-10 was the highest P-value observed for known prion-forming protein tracts (Table 1).

Metrics for assessing region discovery
To assess and compare parameter set performance, three metrics were derived: coverage (Cov), median (Med), and interquartile range (IQR) (Fig. 2).
Coverage (Cov) is the proportion of a protein sequence set that is annotated.More liberal parameters lead to greater coverage, and discern more, more mildly-biased regions.Cov is calculated taking account of any annotated region overlap.So, residues that appear in multiple annotated regions are counted only once.Median region length (Med) and interquartile range (IQR) are also calculated from the distribution of region lengths (Fig. 2).IQR is an indicator of region length diversity, with smaller IQR values for more limited variance in region lengths.
These metrics behave consistently for the two programs and two data sets, Uniref50 and the yeast proteome, with distinct behaviour only for the third data set, the ASTRAL domains, which is to be expected, since it comprises structured regions only (Suppl.Table 1).IQR/Med and Cov are correlated in all cases, implying that increased coverage comes with an increased range of region sizes.For the UniRef50 data random samples of a third of the size of the sample studied yield highly correlated values for Cov, Med and IQR for both programs (Pearson R 2 > 0.99), indicating sufficient sample size.The metrics are also highly correlated between the ASTRAL structural domain set and the UniRef50 sample and yeast proteome (Pearson R 2 > 0.94).

Deriving curves indicating optimal strategies for CBR/LCR annotation
The behaviour of parameter sets was probed using plots of IQR/Med versus Med, for intervals of Cov.The following Cov intervals were used since they have approximately equal numbers of points: 0.015-0.025(~ 2%), 0.04-0.06(~ 5%), 0.08-0.12(~ 12%), 0.2-0.3(~ 25%), 0.35-0.45(~ 40%).Upper and lower boundaries for the point distributions were derived for each plot, with logarithmic equations almost universally best fitting (Fig. 2).Boundary points were defined as extreme relative to all the points above or below them within a margin added around the point along the Med axis, with different margins (in the range 3-9) being tried, with 3 discovered as optimal.Because of the characteristic banding on these plots, the margin was skewed to lower values for the lower boundary and higher values for the higher (e.g., x − 5 to x + 1 for the point x for the lower boundary).From the average of these two boundary equations, 'middle curves' were calculated (Fig. 2).The points nearest these boundary and middle curves (with a tolerance of ± 0.05 IQR/Med) were then analyzed for relationships between metrics and parameters.These relationships were also derived using appropriate line fitting, with characteristic power-law or straight-line relationships between parameters being discovered (discussed in detail in "Results and discussion").These relationships are robust to missing points that are partitioned to different plots dependent on the Cov intervals examined.These were analysed collectively to derive an optimal annotation protocol for LCRs/CBRs of a given target length.

Further fLPS parameters
There are extra fLPS parameters for: (i) expected amino-acid frequencies (−c option), (ii) initial search granularity (−z option) 15 .The -c option can be: 'equal' (= 0.05 for each amino acid), 'domains' (frequencies from ASTRAL domains 28 ), or 'user' (from input sequences).Cov shows a clear trend for '−c' , with all other parameters Table 1.Parameter values analyzed for SEG and fLPS.*The minimum window m ≤ maximum window M. **The trigger information entropy threshold K1 is ≤ K2 for all runs.The maximum value for K1 and K2 in amino-acid sequences is log 2 (20) = 4.3.The recommended parameter sets are taken from the original reference for SEG 5 .† Refs. 2,22,23,25,26.

How do 'recommended' parameter sets perform?
What is a low-complexity region (LCR)?What is a compositionally biased region (CBR)?These questions are typically answered by applying the recommended or default parameter sets of programs that annotate them.For example, often LCRs are defined through default application of SEG simply because researchers have always tended to define them that way.However, these default parameters have been quite arbitrarily chosen.Indeed, LCRs and CBRs exist on a spectrum of compositional bias, with LCRs generally shorter and more repetitive, but some cases may also be long (Fig. 3).The most extreme LCRs are, of course, homopeptides 29 .Also, different parameters can yield widely differing results.Some 'recommended' or default parameter sets tend to annotate longer regions, others shorter ones (higher or lower median (Med) values in Table 2).Coverage (Cov) values vary widely, with default SEG annotating ~ 9% of proteins, but with alternative SEG parameters for longer regions covering > 3 times as many residues.fLPS parameter sets demonstrate a corresponding range of Med and Cov.Default 'catch-all' fLPS parameters yield high coverage (> 60%), since they are designed to comprehensively capture regions with a compositional perturbation; for these parameters, any remaining unannotated sequence regions can be considered 'high-complexity' .In general, annotations made by fLPS have a greater diversity of lengths (wider IQR), than those made by SEG.
The annotations made by Lee et al. are derived from an image-processing algorithm applied to dot plots 17 .These annotations for budding yeast are short with very low length diversity (Med = 13; IQR/Median = 0.385), and have very low coverage = 2.3%.
Low-complexity or compositionally-biased sequence in structured protein domains is clearly rarer and less diverse lengthwise, regardless of the parameters chosen (Table 1).The ASTRAL set stands out as always having lower IQR/Med values, and having much lower Cov values generally.Thus, sequence complexity is higher at every resolution in the structured parts of proteins.

The recommended parameters located in parameter space
How do these default or recommended parameter sets compare to the rest of their parameter spaces?How are these parameter sets special?To gain answers to these questions, the plots of IQR/Med versus Med containing the www.nature.com/scientificreports/data points for the recommended sets for short LCRs were examined (Fig. 4).Each point in such plots represents a set of parameters.Upper and lower boundaries and mid-lines were calculated as described in Methods.The mid-line trend indicates an 'average' or 'half-way' degree of length diversity.The default SEG parameters are situated very close to the mid-line trend, with a median length of 15 (Fig. 4A).Points near the lower bound can be considered the most narrowly focused around a particular median, and those near the upper bound the most spread out.A typical lower-boundary point just below the default SEG point is highlighted and indicates the typical solution for this extreme of focus, i.e., K1 trigger entropy threshold = K2 extension threshold.A similar style of lower-bound solution is indicated for fLPS, with m set equal to M (Fig. 4B).
The patterning of points on these plots morphs in a distinct way for either algorithm as Cov increases (Suppl.Fig. 1).However, at very high Cov values (> 0.7 for SEG and > 0.5 for fLPS) they converge to a similar scattering of points and a common limiting behaviour.For such high Cov, a lot of proteins are being annotated over their full lengths, and lower Med values (and concomitantly higher IQR values) simply arise as progressively smaller window lengths pull in annotations of shorter proteins.Table 2. Metrics for recommended parameters sets for fLPS and SEG.*Used for identifying prion-like regions or compositionally biased dark matter 2 .For prion-like composition, biases for Q and/or N residues are considered specifically.**Used for identifying prion-like regions 22,23  For both programs there are standard relationships between this Med target length and chosen parameters for the mid-line trends.All correlations for these trends are significant (P < 0.0001).For example, there are powerlaw relationships between window lengths (M for fLPS and L for SEG) and Med target lengths, with Pearson R 2 > 0.92 for fLPS and > 0.99 for SEG for all Cov ranges studied (Fig. 5A,B).An example of this is shown for SEG and the ~ 10% coverage interval (Fig. 5C).This power-law may thus be an inherent property of such windowbased algorithms.Also, logarithmic relationships are standard for the second parameter that determines bias levels (namely the thresholds K2 for SEG with R 2 values > 0.95 and t for fLPS with R 2 values > 0.92).An example is shown in Fig. 5D.
The lower-bound trends yield the narrowest focus around a particular target length.The solution of these trends is very simple.For fLPS, m = M and M and log(t) are proportional to Med and to each other (R 2 values > 0.89).For SEG, K1 = K2, and K2 and the window length L are proportional to Med, and to each other (R 2 values > 0.92).
However, parameter sets extracted from near the upper bound of these plots for SEG demonstrate that it is not meaningful to consider these solutions.They only exhibit a logarithmic correlation between Med and K2 (e.g., R 2 = 0.8 for the ~ 10% coverage interval) with both L and K1 being un-correlated with Med, and there no subsets with regular patterning of K1-K2, as are observed for the other trends (Fig. 5B).This implies that the value of the K1 trigger threshold is arbitrary, and the annotations thus have no biological meaning, since they are an arbitrary subset of what is possible.This highlights a key feature of the SEG algorithm to be aware of, in that the LCRs/CBRs it annotates are all based on a core region that can have lower sequence entropy than the rest of the LCR.For the fLPS algorithm, these upper bounds are meaningful, with maximum IQR given by m set equal to the lowest value 5 and M and t correlated with target length (Med) (e.g., Med vs. M, R 2 = 0.98; Med vs. log(t), R 2 = 0.97 for the 10% Cov interval).However, in general, for simplicity, we decided not to use upperbound trends in these analyses.that yield short median lengths and higher coverage, or vice versa.The default SEG parameters re-emerge in the table (blue highlights); equivalent fLPS parameters are also highlighted.

Software
A pair of programs fLPSparameters and SEGparameters to choose parameters to perform such analysis are available at: https:// github.com/ pmhar rison/ param eters.They allow for either a 'diverse' focus (the mid-line trend) or a 'narrow' focus (the lower boundary trend), i.e., with the least possible diversity of region lengths.The discussion below just uses the default 'diverse' focus, for greater simplicity.

An optimized strategy: progressive parsing of CBRs/LCRs annotated across multiple target lengths
So, what is an optimized strategy for annotating regions of a given target length?The best answer is to examine the results from all these progressive program runs (Fig. 6), and assess the biological relevance at each stage.Such an approach may be productive for large-scale bioinformatical analyses involving cross-referencing with other information about function.It may also be useful in directing the parsing of domains to fashion experimental constructs.Thus, what is a meaningfully defined LCR/CBR is determined by such progressive analysis within the relevant biological context for a specific protein under experimental study.
An example of applying these scaled parameter sets is shown (Fig. 7).Human prion protein PrP (UniProt accession P04156) is dissected with parameters for 'short' and 'long' target lengths (15 and 150 residues).PrP underlies mammalian prion diseases, through amyloid formation, and functions in copper metabolism and circadian control 31 .For fLPS, the 'long' parameters annotate the protein's repetitive copper-binding tract, which converges to a maximum length for parameters with estimated coverage > ~ 10%.The same tract is also found by 'long' SEG parameters, but it is lengthened to include the A/G-rich tract that is transmembrane in some PrP isoforms, and is implicated in conversion to amyloid 32,33 .This region is annotated separately by fLPS for the 'short' parameters, along with two other tracts that may be biologically significant.The 'short' SEG analysis evidences a slow, gradual filling-in of the whole sequence as estimated coverage rises (Fig. 7).
Further examples of this progressive CBR parsing show that some CBRs are detected regardless of target length, but some are only detected with short target length (Suppl.Fig. 2).For a human collagen (Suppl.Fig. 2B), short proline-rich tracts appear at target length = 15, which then expand into longer regions as estimated coverage is increased.In Saccharomyces cerevisiae MRN1 RNA-binding protein (Suppl.Fig. 2C), a putative intrinsicallydisordered region up to residue 195 (predicted by AlphaFold, in UniProt entry Q08925) is parsed into distinct sub-regions.Arguably, in general, such results based on parsing CBRs according to compositional biases, rather than sequence entropy, are more meaningful biologically, since sequence entropy per se is less likely to be under selection than, say, a bias for glutamine or glutamate residues linked to specific functional roles.

Application to searching for CBRs of a given target length
Parameter choice focused on CBR target length can also be used to pick out a data set of CBRs with a similar bias.This is illustrated for the M domain of Sup35 protein from S. cerevisiae, a domain which mediates pH sensing during reversible condensate formation in response to stress 34 .A ~ 90-residue {KE}-rich CBR that corresponds to the M-domain was discovered using intermediate target lengths and > 5% estimated coverage (Suppl.Fig. 3A,B); Figure 6.Optimal parameters for a given target length.Considering the median Med to be a target length, parameters can be selected to aim at this target.These are listed for a 'short' target length (15 residues) and a longer one (150 residues).They are taken from the 'mid-line' analysis, which is termed a 'diverse' focus in the distributed software fLPSparameters and SEGparameters.The approximate coverage expected for each parameter set is listed in the middle.www.nature.com/scientificreports/ the yeast proteome was then scanned for {KE}/{EK}-rich CBRs of target length = 90 with estimated coverage of 5% (Suppl.Fig. 3C).Significant Gene Ontology category enrichments were observed for these CBRs that are linked to rRNA and ribosomal processing (Suppl.Fig. 3C).Interestingly the Lee et al. annotations do not contain For shorter low-complexity regions: m = 5, M = 25, t = 1e−05 m = 5, M = 25, t = 1e−06 To find most shorter low-complexity tracts, and longer compositionally biased regions (default), i.e., 'catch-all' default parameters: m = 15, M = 500, t = 1e-03 Parameters used for identifying prion-like compositional biases or biased 'dark matter': † m = 15, M = 500, t = 1e−10 m = 15, M = 100, t = 1e−10 5 ≤ m ≤ 100, 5 ≤ M ≤ 1000, 1e-03 ≥ t ≥ 1e-12 SEG** For shorter low-complexity regions (default): L = 12, K1 = 2.2, K2 = 2.5 For longer compositionally biased regions, such as those made of longer tandem repeats: L = 25, K1 = 3.0, K2 = 3.3 L = 45, K1 = 3.4,K2 = 3.75

Figure 2 .
Figure 2. Analyzing the algorithm parameter spaces.A set of protein sequences is analyzed for each parameter set to extract three metrics: (i) the coverage (Cov), which is the proportion of the protein data set annotated by the algorithm; (ii) the median length of annotated regions (Med); (iii) the interquartile range of the distribution of regions lengths (IQR).Plots of IQR/Med versus Med are derived for intervals of Cov, and upper and lower boundary curves are fitted, then average 'mid-line' trends calculated.The parameter sets that yield the lower and mid-line bounds are extracted, as described in Methods.

Figure 3 .
Figure 3. Low-complexity regions (LCRs) can be considered a subset of compositionally biased regions (CBRs).Examples of LCRs and other CBRs are shown that were discovered by fLPS.

Figure 6
Figure 6 shows how program parameters scale as the implied coverage of annotations increases.Parameters for shorter and longer LCRs/CBRs clearly behave very differently.Also, we can see examples of parameter settings

Figure 5 .
Figure 5. Discovering trends in program parameters.Points near the trendlines in the IQR/Med versus Med plots are extracted and examined.For both programs (A) fLPS and (B) SEG, there are power-law relationships between window size and Med, and logarithmic relationships with Med for a second parameter that measures the degree of bias.Different sets of solutions for the third parameter (m for fLPS and K1 for SEG), then arise in the data.In (C) and (D), there are examples of plots of Med versus L and Med versus K2 for a coverage interval of ~ 10% (Cov 0.08-0.12). https://doi.org/10.1038/s41598-023-50991-8

Figure 7 .
Figure 7. How fLPS and SEG can parse a protein progressively using short and long target lengths.The examples of 'short' (15-residue) and 'long' (150-residue) target lengths from Fig.6are employed on the human prion protein PrP as an example, using a 'diverse' focus.The sequence parts that add in at each level are coloured progressively.The definitions of LCRs from Lee et al.17 are underlined.
, but applied generally to any biases here.***Parameters for labelling longer CBRs such as those made from tandem repeats with repeat lengths longer than.